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Redundant arrays of distributed disks (RADD) can be used 
in a distributed computing system or database system to provide 
recovery in the presence of disk crashes and temporary and 
permanent failures of single sites. In this paper, we look at the 
problem of partitioning the sites of a distributed storage system 
into redundant arrays in such a way that the communication 
costs for maintaining the parity information are minimized. 
We show that the partitioning problem is NP-hard. We then 
propose and evaluate several heuristic algorithms for finding 
approximate solutions. Simulation results show that significant 
reduction in remote parity update costs can be achieved by 
optimizing the site partitioning scheme. a 19% Academic Press, inc 


1. INTRODUCTION 

Redundant disk arrays are used for the purpose of pro- 
viding reliable storage while increasing the HO bandwidth 
in high performance systems [1, 2]. Redundant disk arrays 
can also be used in a distributed setting to increase avail- 
ability in the presence of temporary site failures, disk fail- 
ures, or major disasters. Stonebraker and Schloss have 
proposed the redundant arrays of distributed disks 
(RADD) scheme [3] as an alternative to multicopy 
schemes, which are much more costly in terms of storage 
requirements. Cabrera and Long [4] have proposed the 
use of redundant distributed disk striping in a high speed 
local area network to support such //O-intensive applica- 
tions as scientific visualization, image processing, and re- 
cording and playback of color video. The RADD concept 
can also be used in multicomputer ItO subsystems such as 
the one proposed by Reddy and Banerjee [5] for hyper- 
cubes. 

The IDA approach proposed by Rabin [6] provides an- 
other way to tolerate failures in distributed storage systems 
with limited extra storage cost. However, in that approach, 
updates are more costly since all the fragments of the 
dispersed data are needed to recompute the encoding 

1 This research was supported in part by the National Aeronautics and 
Space Administration (NASA) under Contract NAG 1-613 and in part 
by the Department of the Navy and managed by the Office of the Chief 
of Naval Research under Grant N000 1 4-9 1-J- 1283. 

2 This work was performed while the first author was at the Coordinated 
Science Laboratory, University of Illinois. 


which involves multiple remote accesses. In the case of 
RADD, a local update will generate a single remote access 
for updating the parity. 

When RADDs are used, sites are grouped together to 
form a redundant array containing data and parity and 
capable of recovering from a single site failure. The size 
of each array is fixed and is determined by the tradeoff 
between the availability requirements of the system and 
the cost of the storage overhead. Hence, a large distributed 
data storage system may have to be divided into several 
arrays of fixed size. In this paper we look at the problem 
of partitioning the distributed storage system into fixed- 
size arrays in such a way as to minimize the cost of remote 
accesses that have to be performed to update the parity 
information. This problem is somewhat related to the prob- 
lem of file allocation and replica placement in a distributed 
system, which has been studied extensively in the literature 
[7, 8]. However, the two problems are different in nature 
because, in the RADD case, there is one redundant item 
for N data items while in the file allocation problem each 
file is replicated several times. More importantly, in the 
replica placement problem there is no stringent constraint 
on the number of sites “sharing” a replica because. When 
the replica becomes unavailable, those sites can access the 
second nearest replica while in the RADD case there is a 
hard constraint on the number of sites in an array. Note 
that the assignment of sites to redundant arrays (parity 
groups) can occur after all decisions on placing the data 
have been made. Data placement decisions are governed 
by a different set of criteria and are more influenced by 
the read access patterns since reads are usually more fre- 
quent than updates. Decisions on site assignment to redun- 
dant arrays are based on the update rate at each site and 
the cost of communication between sites and are indepen- 
dent of the read access rate. Changing the assignment of 
sites to redundant arrays does not change the placement 
of the data. The purpose of site assignment is to reduce 
the cost of the parity traffic and does not directly affect 
the data traffic. 

In the following section, we describe the RADD organi- 
zation. In Section 3, we present the model used to formu- 
late the problem mathematically and we prove that the 
problem is NP-hard. In Section 4, heuristic algorithms for 
solving the problem are described and results from an 
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experimental evaluation are presented. In Section 5 we 
develop heuristics with guaranteed bounds on the devia- 
tion from the optimal cost. In Section 6 we address the 
issue of hot spots and non-uniform site capacity and discuss 
the use of RADD for disaster recovery in OLTP systems 
as well as the issue of when and how often site reassignment 
should be initiated. 

2. DISTRIBUTED REDUNDANT DISK ARRAY 
ORGANIZATION 

The RADD organization is shown in Fig. 1. The data 
at each site are partitioned into blocks. Data blocks from 
different sites are grouped into a block parity group. The 
bitwise parity of the data blocks in each parity group is 
computed and written at a different site. In Fig. 1, D tj 
denotes a data block, P, denotes a parity block, and 5, 
denotes a spare block, all at site L The number under block 
in the first column of the figure denotes the physical block 
number on disk. Each row in the figure represents a parity 
group. The position of the parity block is rotated among 
the sites in order to avoid creating a bottleneck at the site 
where parity is stored. For every update to one of the data 
blocks in the parity group, the parity block needs to be 
updated using the following formula: 

^ncw = (D 0 i d © D new ) © P old . 

Spare blocks are provided to make it possible to recon- 
struct data blocks that become inaccessible due to site 
failure. The failed data block is reconstructed by XORing 
all other data blocks and the parity block in its parity 
group. If K denotes the number of data blocks per parity 
group then N = K + 2 denotes the number of sites in a 
distributed disk array. The storage overhead for the parity 
and spare blocks required by RADDs is (200 IK)% com- 
pared to a 100% overhead for the case of two copy schemes. 
In terms of performance, both approaches require one 
remote access per update, while the RADD scheme may 
require two additional local accesses per update to read 
the old data and old parity in order to compute the new 
parity. Under failure, RADD will perform much worse 
than the two-copy scheme because it requires K remote 
accesses for reconstructing a data block from a failed site. 
However, if failures are expected to be rare, the perfor- 
mance degradation associated with RADD may be justifi- 


able in light of its significant savings in terms of storage 
costs in comparison with the two-copy scheme. 

3. THE MODEL 

We model the distributed computing system as an undi- 
rected connected graph G = (V, E ), where V is the set 
of sites and each edge e E E represents a bidirectional 
communication link between two sites. For each e E £, 
w e denotes the cost of communication over link e. We 
assume that if n is the number of sites in V then n = mN 
for some integer m. We assume that the site capacity is 
uniform. In Section 6.2 we show how to deal with nonuni- 
form site capacity. In the pattern shown in Fig. 1, the parity 
blocks of the N — 2 data blocks from site i reside on sites 
(i + 1) mod N through (/ + N — 2) mod N. If the same 
pattern is repeated throughout the range of blocks then 
there will be no parity update traffic from site i to site 
(i - 1) mod N. In order to make the problem symmetrical 
and thus easier to tackle, we assume that for the next set 
of N blocks the pattern shown in Fig. 2 is used. In all, there 
are N - 1 such patterns obtained by changing the distance 
between the parity block and the spare block on a given 
row. These N - 1 patterns should alternate throughout 
the range of blocks so that update traffic from a given site 
is distributed over the remaining N - 1 sites. This will also 
provide more load balancing for the parity update traffic 
in the array. 

Let fi v designate the rate of update accesses to data 
blocks at site v. Each update will cause communication 
between the site where the update took place and the site 
holding the parity for the given data block. At each site 
the set of data blocks that have their corresponding parity 
blocks on the same remote site is called a data group. To 
simplify the model, we assume that the N - 1 data groups 
share equally the update rate. This implies that the rate 
at which site v sends parity update information to each 
other site in its redundant array is \ v = fi v /(N - 1). This 
assumption is supported by the fact that consecutive data 
blocks have their parity blocks on different sites, which 
implies that accesses to a heavily used file that is stored 
on consecutive disk blocks will be spread over different 
data groups. In Section 6, the above assumption will be re- 
moved. 

The problem of partitioning the sites into arrays of size 
N in such a way that parity update costs are minimized 
can be mathematically formulated as follows: 
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FIG. 2 . Alternative placement pattern for parity and spare blocks. 
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Problem 1 (SP). Find a partition of V into m disjoint 
subsets V u V 2 , V, n of size N such that if d(u , u) denotes 
the length of the shortest path between u and v then 
2™ | A„ (w| d(u y v) is minimum. 

Theorem 1. Problem SP is NP-hard for any fixed 
N > 3. 

Proof . We prove that problem SP is NP-hard by show- 
ing that there is a polynomial time transformation from 
the problem of partitioning a graph into cliques of size N 
to problem SP. The partition into cliques of size N (PC) 
problem can be stated as follows: 

Instance . A graph G = (V, E), with \ V\ = Nm for some 
positive integer m. 

Problem. Is there a partition of V into m disjoint sub- 
sets V u V 2l .... V m such that the subgraph of G induced by 
Vi is a clique of size N (complete graph with N nodes)? 

PC is NP-complete for any fixed N ^ 3 (see partition 
into isomorphic subgraphs [9]). To transform an instance 
of PC into an instance of SP, it is sufficient to set X v — 1 
for all v E V, and w t . = 1 for all e E E. Then graph G can 
be partitioned into cliques of size N if and only if the cost 
of the optimal solution to the above instance of problem 
SP is n(N — 1). ■ 

The cost function 2„ eV / A M 2 L , eV /. („} d(u, v) can 
be rewritten as (A„ + h v )d{a, u) — 

2"! I ^ lt ueV li ^ v D(u , u), where D(u, v) is defined as D(u, 
v) = (A u + A v )d(u, v). In this form the general problem is 
reduced to a uniform load problem with the pseudo-dis- 
tance D replacing d. However, D is not a true distance since 
it does not necessarily satisfy the triangular inequality. 

4. APPROXIMATION ALGORITHMS 

4.1. Description of the Heuristics 

The first heuristic is based on a greedy strategy that 
consists of satisfying first the sites with the largest update 
rate. Let A be the list of update rates for all sites. When 
sites are grouped into clusters (redundant arrays) their 
update rates are removed from A and replaced by a single 
update rate for the cluster. The cluster update rate is the 
average update rate of the sites in the cluster. 

Algorithm 1 

Step I . Select the largest value in A and let a be the 
corresponding site (or cluster). Find the site (or cluster) b 
such that merging a and b results in the smallest increase 
in the cost function. Merge the two sites (or clusters) if 
the resulting cluster has less than N sites and the total 
number of clusters does not exceed m. If the clusters cannot 
be merged, find the next best choice for b and repeat. 
Remove the update rates of the merged sites (or clusters) 
from A and replace them with the cluster update rate. 


Step 2. Repeat Step 1 until m clusters having N sites 
each have been formed. 

The computational cost of Algorithm 1 is 0(Nn 2 ). But 
it requires that the all-pair shortest path algorithm be per- 
formed first, which requires 0{n y ) operations. 

The second approach consists of two stages: in the first 
stage m sites are identified to be used as cluster seeds and 
in the second stage the remaining sites are allocated to the 
clusters to form m subsets of N sites each. 

Algorithm 2 

Step 1. Select the two sites with the largest distance 
between them and include them in the set S of cluster seeds. 

Step 2. Select the site v with the largest average dis- 
tance to the sites already in S and add it to S. 

Step 3. Repeal Step 2 above until \S\ = m. Each cluster 
initially contains one of the m seeds in S . 

Step 4. For each of the m clusters, compute the average 
update rate of the sites in the cluster. In decreasing order 
of their average update rate, allocate to each cluster the 
site that is closest to it in terms of the pseudo-distance D. 

Step 5. Repeat Step 4 above until all sites have been 
allocated to the m clusters. 

We use the pseudo-distance metric D in Step 4 because 
it provides the actual increase in the cost function of a 
cluster when a node is added to it. The computational cost 
of the Algorithm 2 is 0(Nn 2 ). It also requires that the all- 
pair shortest path algorithm be performed first. 

The third approach is based on the hierarchical cluster- 
ing technique [10]. We use the distance matrix whose en- 
tries are d(u, u ) for all m, v E V. Clusters are formed by 
merging together sites or smaller clusters that are close 
to each other. When two sites (or clusters) are grouped 
together, the distance matrix is modified by eliminating 
the columns and rows corresponding to the merged sites 
(or clusters) and replacing them with a single column and 
a single row reflecting the average distance between the 
merged sites and other sites (or clusters). The procedure 
is as follows: 

Algorithm 3 

Step /. Find the smallest entry in the distance matrix 
and merge the two sites (or clusters) together if the re- 
sulting cluster has N sites or less and if the total number 
of clusters does not exceed m. If any of the latter conditions 
is not satisfied, select the next smallest entry and repeat. 
Once two sites (or clusters) have been merged, update the 
distance matrix and the number of clusters accordingly. 

Step 2. Repeat Step 1 above until m clusters having N 
sites each have been formed. 

The complexity of Algorithm 3 is 0(n 3 ). 

After an initial partition has been found, the following 
procedure may be used to improve it. 
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TABLE I 


Comparison between Approximate Solutions and the Optimal Solution 


K w , K a 

Random 

Algorithm 1 

Algorithm 2 

Algorithm 3 

Exhaustive 

1000, 10 
100, 100 
10, 1000 

69967 (1157) 
67477 (1126) 
98427 (1247) 

53853 (972) 
51606 (950) 
77964 (1061) 

54428 (975) 
52623 (941) 
78284 (1046) 

53732 (963) 
52064 (940) 
77949 (1045) 

48678 (870) 
46761 (848) 
70741 (931) 


Procedure Improve 

Step 1. Select the site u with the highest update rate. 
For each site v outside site w’s partition, compute the 
change in cost A C(w, u) if u and v were swapped. Let u* 
be the site corresponding to the minimum change in cost: 
A C(«, v*) = m\n u€V AC(m, v), where V u denotes «’s parti- 
tion. If AC(«, v*) < 0 then swap u and v *. 

Step 2. Repeat Step 1 above for all sites in V in decreas- 
ing order of their update rate. 

The complexity of the above procedure is 0(/i 3 ). The 
procedure may be repeated several times to improve the 
total cost. The procedure could be repeated until a local 
minimum of the cost function was reached. However, it is 
not guaranteed that such a local minimum will be reached 
in finite time. The procedure can also be employed as the 
basic move in metaheuristics, such as simulated annealing 
[11] or tabu search [12], that avoid getting trapped in a 
local minimum. 

4.2. Experimental Evaluation 

We have conducted experiments to evaluate the approx- 
imate solutions obtained using the heuristics and to com- 
pare the three proposed approaches for site assignment. 
In the experiments, we used randomly generated graphs. 
The distance on each edge in the graph was drawn from 
a uniform distribution over the interval [1 , K w \. The update 
rates at each site were drawn from a uniform distribution 
over the interval [1, K x \. 

In our experiments we found out that Algorithm 2 per- 
forms better when the pseudo-distance D is also used in 
the first stage of the algorithm. This can be explained by 
the fact that using D in the generation of the cluster seeds 
ensures that edges with large D(u, v) will not be used 
within a cluster, i.e., sites that have large loads and that 
are far apart are not placed in the same cluster. The results 
shown here for Algorithm 2 were obtained using D instead 
of d. 

In the first experiment, we compare the approximate 
solution provided by the heuristics to the optimal solution. 
The optimal solution was obtained using exhaustive search. 
N was taken to be equal to 5 and n equal to 15. Table I 
shows the results for three situations: one where the edge 
weights vary more widely than the site loads, one where 
both are picked from the same interval, and one where 
the site loads vary more widely than the edge weights. 
Each entry represents the average over 1000 randomly 


generated graphs. The costs of the approximate solutions 
are within 10% of the cost of the optimal solution. In the 
first column of the table, we have listed the cost of a random 
solution. For each number the half-width of the corre- 
sponding 95% confidence interval is shown between paren- 
thesis. 

Since, in the first experiment, an exhaustive search was 
used to find the optimal solution, the number of nodes n 
could not be very large. In a second experiment, we com- 
pared the performance of the three heuristics for larger 
values of n. Figure 3 shows the results for the second 
experiment. For clarity of the figure, we plotted the cost 
of the approximate solution divided by 1000. For each data 
point, the 95% confidence interval is shown. In the case 
N = 10, Algorithm 3 outperforms Algorithms 1 and 2 for 
all values of n except n — 20, in which case Algorithm 2 
performs better. For the first and second environments, 
Algorithm 1 outperforms Algorithm 2 for large values of 
a?, but for the last environment Algorithm 2 outperforms 
Algorithm 1. For N = 5, Algorithm 2 does not do very 
well except in the last environment in which the range of 
site loads is much larger than the range of edge weights. 
Algorithm 1 performs best in the first two environments. 
The main point that can be deduced from this experiment 
is that, in spite of the fact that Algorithm 3 does not use 
any information about site loads, it outperforms the other 
two algorithms when n and N are relatively large and, in 
the other cases, its performance is always close to that of 
the best algorithm. This means that, in a large system, it 
is more important to minimize the sum of the edge weights 
within each cluster than to use the greedy approach that 
attempts to assign to the sites with large loads their nearest 
neighbors. Since site loads vary with time, the solution 
found by Algorithm 3 will remain close to optimal as the 
site loads change while solutions based on estimates of site 
loads will degrade with time as the site loads deviate from 
the estimates. This is especially true for large N. A large 
value for N means lower storage costs but also lower relia- 
bility and worse performance under failure. 

5. HEURISTICS WITH PERFORMANCE GUARANTEES 

The heuristics described in Section 4 provide a good 
approximate solution. However, there is no guarantee that 
the approximate solution will not diverge significantly from 
the optimal one in certain cases. In this section, we seek 
to find a heuristic that has a bound on the error between 
the approximate solution and the optimal one. We develop 
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FIG. 3. Comparison between the three heuristics. 


such a heuristic first for the case of a system with balanced 
load, \ v = A, for all v E V, and uniform edge weights, then 
we look at the more general case of a balanced load system 
with arbitrary edge weights. Since a problem with arbitrary 
site loads can always be transformed into a problem with 
uniform site load as shown in Section 3, then the heuristic 
for the balanced load case with arbitrary edge weights 
will also provide performance guarantees for the arbitrary 
load case. 

5.1. Balanced Load and Uniform Edge Weights 

The heuristic requires the use of a spanning tree with 
many leaves. The problem of finding a spanning tree with 
a maximum number of leaves is NP-hard [9], however, 


there exist polynomial time algorithms for generating span- 
ning trees with many leaves. Typically these methods guar- 
antee that a certain fraction of the nodes will be leaves. 
The fraction of leaves is a function of the minimum degree 
k of the graph. Kleitman and West proved the following 
result [13]: 

Theorem 2 (Kleitman-West). If k is sufficiently large , 
then there is an algorithm that constructs a spanning tree 
with at least (1 - h In klk)n leaves in any graph with 
minimum degree k , where b is any constant exceeding 2.5. 

It was also conjectured that a spanning tree can be con- 
structed with a larger fraction of leaves. More specifically, 
Linial conjectured that the number of leaves could be at 
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FIG. 4. Example of a tree partitioned using the procedure 

Partition-Tree. 


least ( k - 21 k + \)n + c k . This stronger result was proved 
for k = 3 with c 3 = 2 and for k - 4 with c 4 = 8/5 [13]. 

Algorithm 

Step 1. Find a spanning tree with many leaves. 

Step 2. Partition the spanning tree into m clusters of 
N nodes each using procedure Partition-Tree described 
below. 

The partition found for the tree will be used as the 
approximate solution for the partitioning problem in the 
original graph. We first describe a basic version of the 
procedure Partition-Tree which insures that every edge 
in the tree is used by at most two clusters. Then we describe 
an optimization that reduces the cost in the tree of the 
solution but that is not needed to establish the bound on 
the cost of the heuristic solution. In the description of 
the procedure Partition-Tree, we assume that the tree is 
levelized starting from the root. Figure 4 shows an example 
of a tree partitioned using this procedure. 

Procedure Partition-Tree 

The procedure partitions the tree from the bottom up 
and from left to right. As the clusters are built, whenever 
the size of a cluster reaches N nodes, that cluster is removed 
from the tree. Starting from the deepest leaf of the leftmost 
branch in the tree, the leaf is assigned to the first cluster. 
After a node has been assigned to a cluster, its sibling to 
the right is considered next. If no siblings are left to right 
of the node then the parent is assigned next. If the sibling 
to the right is a leaf, it is included in the cluster, otherwise 
the leftmost branch of the subtree rooted at that sibling is 
followed to its deepest leftmost leaf and that leaf is in- 
cluded in the cluster. Then the procedure continues from 
that point moving to the right sibling (if any) or to the 
parent in the same fashion. When a node is to be assigned, 
it is either assigned to the current cluster if that cluster 
has not reached N nodes or a new cluster is formed and 
the node assigned to it. The tree remains connected as 
newly completed clusters are removed. 

Theorem 3. The cost (HEU) of the approximate solu- 
tion found using a spanning tree with many leaves and the 


cost (OPT) of the optimal solution satisfy the following rela- 
tionship : 


HEU 

OPT 


^ 2a + (1 — a) 


N 2 

N - V 


where a is the fraction of leaves in the spanning tree. 

Proof We need to establish an upper bound on the 
cost of the approximate solution and a lower bound on 
that of the optimal one. The cost in the graph of the approx- 
imate solution is at most the cost of that solution in the 
tree. We evaluate the cost in the tree by adding up the 
contributions of each edge in the spanning tree to the 
overall cost. If an edge connects a leaf node to the tree it 
will be referred to as a leaf edge, otherwise it will be called 
an internal edge. A leaf edge will be used in only one 
cluster and it will be used only for communication between 
the leaf node and the other (N — 1) nodes in the cluster. 
Therefore the contribution of a leaf edge to the overall 
cost is 2 (N - 1). An internal edge will be used in at most 
two clusters and in each cluster it will be used by i nodes 
to communicate with the other N - i nodes in the cluster. 
If a designates the fraction of leaf nodes in the tree, we have 

HEU ^ an X 2 (N — 1) + (« — 1 — an) 

X 2 X max 2 i(N - i) 

< n(N - 1)(2 a + (1 - a)N 2 /(N - 1)). 

For the cost of the optimal solution, an obvious lower 
bound is the cost in a complete graph, which is n(N — 1). 
Hence, HEU/OPT < 2a + (1 - a)N 2 /(N - 1). ■ 

As stated in Theorem 2, for large k , a converges to 1 and 
the upper bound approaches 2. Note that it is reasonable to 
assume that the minimum degree will be large in practice 
because the underlying network has to have sufficient con- 
nectivity to enable communication under node and link 
failures, and hence, has to have a reasonably large mini- 
mum degree. 

The complexity of the algorithms for generating trees 
with many leaves [13] is 0(\E\). The complexity of the 
Partition-Tree procedure is 0(n). 

There is an optimization to procedure Partition-Tree 
that reduces the cost of the solution in the tree (not neces- 
sarily the cost in the original graph) by reducing the num- 
ber of tree edges that are used by two clusters. It can be 
described as follows: Consider the case where one subtree 
has been processed and there remains an incomplete clus- 
ter (less than N nodes) and assume that there is a subtree 
rooted at the sibling to the right. Procedure Partition-Tree 
would complete the cluster using the lowest leftmost nodes 
of the right subtree. However, if the right subtree is deep, 
a number of intermediate edges will contribute to the cost 
of communicating (in the tree) between the two compo- 
nents of the newly completed cluster. This can be avoided 



SITE PARTITIONING FOR DISTRIBUTED DISK ARRAYS 


7 


by successively removing from the right subtree complete 
clusters formed by connected branches of exactly N nodes 
whose removal does not disconnect the tree and then com- 
pleting the cluster from the left subtree. After those clus- 
ters are removed Partition_Tree proceeds as described 
above to complete the cluster with what remains of the 
right subtree. The requirement that the removed clusters 
be connected ensures that edges remaining in the right 
subtree are not used by any of the removed clusters. Those 
remaining edges may then be used both by the cluster 
formed with nodes remaining from the left subtree and by 
a cluster formed with the remaining nodes in the right 
subtree. 

5.2. Balanced Load and Arbitrary Edge Weights 

For arbitrary edge weights the problem of finding a heu- 
ristic with guaranteed performance bounds is much harder. 
In the following we describe a heuristic for which a worst 
case performance bound can be established. The bound is 
more significant for systems where link communication 
costs (edge weights) do not vary widely. The heuristic 
consists of finding a minimum spanning tree, partitioning 
the tree into clusters using procedure Partition-Tree and 
using that partition as an approximate solution. The follow- 
ing result will be used to establish a lower bound on the 
cost of the optimal solution. 

Lemma 1 . In a complete graph , the average weight of 
the edges in a minimum spanning tree is at most the average 
weight of all edges . 

Proof We use induction on the number of nodes n. 
The lemma is obviously true for n = 2 or n = 3. Suppose 
it is true for graphs with n — \ nodes and consider an n- 
node graph. Select node v such that the average weight of 
edges incident on v is at least the average weight of all 
edges in the graph. Remove v from the graph and find a 
minimum spanning tree in the remaining (n - l)-node 
graph. Then add to this spanning tree the lightest edge e* 
connecting v to the other nodes to form an n- node spanning 
tree. Let MST„_j and MST„ be the total weights of the 
(n - l)-node and the n-node spanning trees, respectively. 
Let <-(u) be the set of edges incident on v. Using the 
induction hypothesis, we have 

MSTrt-1 ^ £/:-<•» W e 

n — 2 (n — l)(/i — 2)/2 


Hence, the average weight of the edges in the minimum 
spanning tree is MST„/(« — 1) ^ wj{n{n - l)/2). ■ 

To obtain a lower bound on the cost of the optimal 
solution, we consider the optimal partition and we build 
a spanning tree by first finding a minimum spanning tree 
in each cluster and then replacing each cluster by a single 
node and connecting each pair of these nodes by the light- 
est edge linking the initial clusters. An intercluster mini- 
mum spanning tree is then found. The intracluster spanning 
trees along with the intercluster spanning tree form a span- 
ning tree for the entire graph. 

Lemma 2. The list of edge weights of the intercluster 
minimum spanning tree (ICMST) is included in the list of 
edge weights of the global minimum spanning tree (GMST). 

Proof. Let e be an edge in the ICMST that does not 
appear in the GMST. Let u and v be its endpoints in the 
original graph and let w be its weight. The path in the 
GMST from u to v induces a path in the intercluster graph 
from the cluster of a to that of v. If the path is a single 
edge then this edge must have weight w and could replace 
the edge e in the ICMST. If the induced path has more 
than one edge then, since the ICMST cannot contain a 
cycle, some of the edges on the induced path must not 
appear in the ICMST. At least one of these induced edges 
that do not appear in the ICMST forms a cycle containing 
e when added to the ICMST. Let e* be such an edge; e' 
must have weight at most w otherwise it could be replaced 
in the GMST by ( u , v) to obtain a spanning tree with a 
smaller cost. In addition e f cannot have weight less than 
w because it would then be possible to replace e by e f in 
the ICMST and obtain a smaller intercluster spanning tree. 
Hence the weight of e 9 is vv and we could remove e and 
replace it with e' in the ICMST. This process can be re- 
peated until all edges in the ICMST also appear in the 
GMST. ■ 

The following theorem establishes a bound on the cost 
of the heuristic based on finding a minimum spanning tree 
in the graph and then using Partition-Tree to find a parti- 
tioning into clusters. 

Theorem 4. The cost (HEU) of the approximate solu- 
tion found using a minimum spanning tree and the cost 
(OPT) of the optimal solution satisfy the following rela- 
tionship: 


Therefore 


MST„ < MST nl + w e * 


< w e W e 

” (n - l)/2 n- 1 


< W e We VV V 

(n-l)/2 n- 1 n~ 1 w(#*-l)/2 

N s, ' 

>0 


^eEE W e 

n/2 


HEU ^ MST 

OPT ~ ™ MST - (m - 1 )w' 

where MST is the total weight of the edges in the minimum 
spanning tree and vv is the average weight of the m — 1 
heaviest edges in the minimum spanning tree. 

Proof. In evaluating an upper bound on the cost of the 
approximate solution, we follow the same procedure as in 
the proof of Theorem 3 but we will not distinguish between 
leaf edges and internal edges. Each edge e in the tree will 
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FIG. 5. Evaluation of the heuristics for the refined model. 


be used by at most two clusters and the contribution of 
e to the overall cost is bounded by 2 X w e X maxis^-i 
2 i(N - /). Hence, we have HEU < A 2 MST. 

Let MST; be the weight of the minimum spanning tree 
of cluster i for 1 < / < m and MST C be the weight of the 
intercluster tree. The intracluster minimum spanning trees 
and the ICMST form a spanning tree in the original graph. 
The total weight of the edges in that spanning tree is at 
least MST: 2,- =1 MST, 4- MST C > MST. By Lemma 2, every 
edge in the ICMST is also in the GMST. Hence MST C < 
(m - 1 )w. This yields MST, + (m - l)w > MST. 
Let OPT, be the contribution to the optimal cost by clus- 
ter i. The average cost of the edges in cluster i is OPT,/ 
(N(N - 1)) and the average cost of the edges in the cor- 
responding spanning tree is MST J(N - 1). Applying 


Lemma 1, we have OPT,//V > MST, therefore OPT > 
tV(MST - (m - l)vv). ■ 

Let r be the ratio of the largest edge weight to the 
smallest edge weight. A looser but simpler bound than the 
one established in Theorem 4 can be derived using the 
parameter r. To do so, we first rewrite the bound as follows: 


HEU/OPT < N 


MST 

MST - (m - 1 )w 


= N 
- N 



(m - 1 )w \ 
MST - (m - l)w) 

(m - \)w \ 

(n - m)aj /' 
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where a> is the average weight of the n - m lightest edges 
in the GMST. Since wUa ~ r and n = mN, we have 

HEU/OPT<jv( 1 + b) <N(l + rl(N - 1)). 

\ n — m ) 

6. GENERALIZATION AND APPLICATION OF 
THE MODEL 

6,1. Non-Uniform Load within Site 

In our model, we assumed that each site sends parity 
updates to each other site in its partition at the same rate. 
This implies a uniform update rate to each of the N — 1 
data groups of a given site that have parity information 
on each of the N - 1 other sites. If the update rate informa- 
tion for each data group at each site is available then the 
model can be refined to account for the difference in the 
rate of parity update requests issued by a given site and 
destined to the other sites in the array. The refined model 
should yield better results in the presence of static hot 
spots. The update rate A„ of site u is replaced by N — 1 

update rates A mJ A„. iV -i corresponding to each of its 

data groups. In this case, an obvious optimization would 
be to have the parity of the i th most frequently accessed 
data group of a given site placed on the i th nearest site in 
its partition. We call this optimization LocalOpt. Note that 
LocalOpt can be implemented without having to reshuffle 
the data on disk by saving the permutation describing the 
remapping of the N - 1 data groups for each site and using 
it to route parity update requests to the proper site. Given 
the above optimization, the algorithms of Section 4 with 
some minor modifications can still be used to partition the 
sites. The site update rate used in Algorithm 1 and 2 is set 
to the sum of all N - 1 data group update rates at that 
site. We have evaluated the three algorithms of Section 4 
in the case of the refined model, along with a new greedy 
strategy that looks at data groups instead of sites and tries 
to place the parity of the data groups with the largest 
update rates on the closest sites. Details of the greedy 
algorithm are provided in the Appendix. 

Figure 5 shows the results of the comparison between the 
four algorithms. The results shown assume that LocalOpt is 
performed. The individual data group update rates are 
chosen randomly from the interval [1, /C A ] while the edge 
weights are chosen from [1, K w ]. We found that Algorithms 
2 and 3 perform best for N = 10 with Algorithm 2 being 
the winner for lower values of n while Algorithm 3 is better 
for the high values of n. For N = 5 Algorithm 3 performs 
best in almost all situations. The reason that Algorithm 3 
performs better for N = 5 in this case compared with the 
uniform load case (Fig. 3) can be explained by the fact 
that the site loads have smaller variance because they are 
the sum of N - 1 rates drawn from the uniform distribution 
over [1, K a ]. The performance of the greedy algorithm 
indicates that basing assignment decisions on individual 



FIG. 6. Advantage of optimizing parity placement within a cluster 
using LocalOpt (N = 10, K u = 100, and = 100). 

data group loads produces poorer results than using total 
site loads. 

We also found that the parity assignment within a cluster 
is as important as the problem of partitioning the sites into 
clusters. Using LocalOpt reduces the cost of the solution 
by 15 to 20%. This is shown in Fig. 6 for the case N — 10, 
K w - 100, and K k - 100. Similar results were obtained for 
the other environments. 

6.2. Non-Uniform Site Capacity 

The case of nonuniform site capacity can be handled in 
the same fashion as proposed by Stonebraker and Schloss 
[3]. We assume that the total number of disks is Np for 
some integer 3 /? and that the number of disks at any given 
site is at most /?. The system could then be partitioned 
using the following procedure. 

Step 1. Select the Al|U|/Aj sites with the largest num- 
ber of disks and apply one of the partitioning algorithms 
described in the previous sections to assign one disk from 
each of the selected sites to an array. 

Step 2. Remove the assigned disks and remove sites 
with no disks left. 

Step 3. Repeat the above steps until all disks have 
been assigned. 

Nonuniform disk capacity can be dealt with by using 
logical disks of size B blocks such that the site capacities 
are multiples of B [3]. 

6.3. Disaster Recovery in OLTP Systems 

Disaster recovery is an important issue in on-line trans- 
action processing (OLTP) systems [14-16]. However, in 
such systems, updating the remote parity after each disk 

■This replaces the assumption that \V\ = mN. 
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update may be too expensive, especially since these sys- 
tems usually have stringent requirements on transaction 
response times. 

Typically, disaster recovery in OLTP systems is imple- 
mented by duplicating the data of a given site at a remote 
backup site and shipping redo log information to the 
backup site where the updates are applied to the backup 
database. There are two approaches used in shipping the 
log [17]. In the first approach, the log records are shipped 
asynchronously to the backup site. Therefore transaction 
response time is not affected by the communication with 
the backup. However some transactions may be lost in the 
case of a disaster. This configuration is called 1-sctfe. In 
the second approach, log records are sent to the backup 
at commit time and the transaction waits for an acknowl- 
edgment before it is allowed to commit. No transactions 
are lost in this case. This configuration is called 2-safe. 

Similar configurations can be implemented using 
RADD. In a 1-safe implementation, parity updates (XORs 
of old and new data) can be accumulated at the originating 
site and shipped to the remote parity locations periodically. 
In a 2 -safe implementation, the parity updates originated 
by a transaction are grouped according to their destination 
site and shipped to that site while the transaction waits 
for an acknowledgment. If the updates performed by the 
transaction involve only one of the N — 1 data groups then 
only one remote message has to be sent by the committing 
transaction and the delay will be the same as in the tradi- 
tional remote backup scheme. The advantage of RADD 
over the traditional schemes is that it uses much less storage 
space than full duplication. 

Our model can still be used to solve the site assignment 
problem in both of the above implementations. However, 
instead of using the update rate at each site, the frequency 
of the periodic updates should be used in the 1 -safe case 
and the update transaction rate should be used in the 2- 
safe case. 

Another optimization that might be useful in OLTP 
environments consists of using the scheme proposed by 
Bhide and Dias in [18] to reduce the number of random 
I! O ' s performed in updating the parity at the remote site. 
The scheme consists of storing the parity updates in non- 
volatile memory or sequentially on a dedicated disk and 
then periodically propagating them to their permanent lo- 
cations. The scheme was originally proposed for use with 
a RAID level 4 organization [1] to reduce the load on the 
parity disk. When the parity updates are stored sequen- 
tially on a dedicated disk, disk sorting is used to apply the 
parity updates to their permanent location. 

6.4. Applying the Algorithms 

Another important question is when and how often to 
apply the algorithm in order to obtain a lower cost site 
assignment. Clearly the algorithms can be used when the 
RADD scheme is first implemented as long as information 
on site loads is available. As these loads change, the perfor- 
mance of the system degrades and the site assignment may 


need to be modified. Changing the site assignment is a 
costly operation. It involves reading large amounts of data 
to recompute the new parity and then updating the parity. 
This operation should be performed when the following 
two conditions are met: (1) the difference between the cost 
of the current assignment and the cost of the best solution 
found by the algorithms should be large enough, and (2) the 
parameters of the system (site loads) should be relatively 
stable so that the benefits of the new site assignment last 
long enough to offset the cost of performing the reas- 
signment. 

The cost of reassignment can be reduced if some clusters 
are kept unchanged. Hence one might be better off choos- 
ing a solution that is not the best possible but that preserves 
most of the current clustering. Procedure Improve de- 
scribed in Section 4 can be used to perform a limited 
number of swaps that decrease the cost of updating the 
parity without a full scale reassignment. 

7. SUMMARY 

We looked at the problem of partitioning the sites of a 
distributed storage system into redundant disk arrays while 
minimizing the communication costs for updating the par- 
ity information. The problem was shown to be NP-hard in 
its general form. Several heuristic methods were investi- 
gated to obtain approximate solutions to the site parti- 
tioning problem. It was found that the heuristic that mini- 
mizes the sum of distances between sites within each cluster 
(Algorithm 3) performs consistently well in all environ- 
ments, especially in large systems with a relatively large 
array size. In such systems, the above approach outper- 
forms greedy methods that attempt to satisfy first the sites 
with the largest loads by placing their nearest neighbors 
in their partition. The solutions produced by Algorithm 3 
are also more robust because they provide good perfor- 
mance under different site loads. Guaranteed upper 
bounds were established on the deviation from the optimal 
cost for some of the heuristics. It was also found that 
modifying the parity assignment within each cluster to 
place the parity of the heavily accessed data groups on the 
nearest sites within the cluster can significantly decrease 
the parity update cost. Finally, we discussed implementa- 
tions of the RADD scheme for disaster recovery in OLTP 
systems and described various optimizations that can be 
helpful in those environments. 

APPENDIX 

Algorithm Greedy 

Let A be the list of update rates for all data groups at 
all sites. 

Let p v be the number of site v's partition. Initially p v = 
-1 for all v Ek 

Let rii be the number of sites in partition i. Initially, 
rii = 0. Assume ri- x = 1 throughout. 

Let k be the current number of partitions. Initially k = 0. 
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Let . ( (c) = V — v, for all v E V. 

Let / - 0. 

Step I. Select the largest value A in A and let u be the 
corresponding site. If n Pi = N go to Step 4. 

Step 2. Find the site v in , I («) that is nearest to u and 

satisfies p u or p v - 1 and n Pu + n Pv < N ox p u = p v - - 1 

and k < m. If none exist go to Step 4. 

Step 3. Remove v from . I (w). 

If p u = p v - -1 set p u = p v = /,«/ = 2, / = / + 1, and 
k = k + L 

If p u = ~ 1 and p v 9^ 1 set p u = p v and n p = n Pi + 1. 

If p u “1 and p v = -1 set p v = /?„ and n p = n p + L 

Ifp, # -1 and p v -1, set the partition number for 
every site in v's current partition to p in set n Pu - n Pu + 
n Pi> n p = 0, and k — k — 1. 

Step 4. Remove A from A. 

Step 5. If 2 , n t < n, go to Step 1 , otherwise stop. 

The algorithm is similar to Algorithm 1 in that it tries 
to satisfy first the nodes with the highest data group update 
rates. The complexity of the algorithm is 0(Nn 2 ), but as 
in the case of Algorithm 1, it requires the all-pair shortest 
path algorithm. 
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